In this handout we will use the data analytic workflow/FACT framework to make predicitons about future sales. Our focus will be on data modeling. The steps and the details that we will focus on are listed next.
Let’s first load the packages that we will use, read in the data, and look at its structure.
library(tidyverse) # For lots of tidyverse packages and functions
library(magrittr) # For pipe symbol
library(GGally) # For nice looking pairplots
library(plotly) # For interactive plots like in Alteryx
# Read in data
df <- readRDS('adSpend.rds') # Fictitious data for illustrating regression concepts
str(df) # Structure
## 'data.frame': 426 obs. of 5 variables:
## $ sales : num 1744 2193 962 1988 2387 ...
## $ adSpendFacebook : num 50 82 13 63 77 70 31 53 63 11 ...
## $ adSpendInstagram: int 2 76 62 65 61 98 76 4 87 32 ...
## $ campaign : chr "C1" "C1" "C1" "C1" ...
## $ store : chr "A" "A" "A" "A" ...
The adspend data is fictitious data that we will use to illustrate regression concepts. It consists of 426 observations and 5 variables. Here is a brief description about each column:
Missing values:
colSums(is.na(df))
## sales adSpendFacebook adSpendInstagram campaign
## 0 0 0 0
## store
## 0
There are no missing values.
Let’s get an idea of how many unique values there are for each campaign and store.
summary(factor(df$campaign))
## C1 C2 C3
## 142 142 142
summary(factor(df$store))
## A B C D E
## 79 82 72 88 105
table(df$campaign, df$store)
##
## A B C D E
## C1 23 24 25 30 40
## C2 18 28 27 37 32
## C3 38 30 20 21 33
Let’s look at a visual representation of the data using a pairplot.
The code cell below creates a really useful set of plots for a dataset that is as small as this. We are only using the first four columns of data: sales, adSPendFacebook, adSpendInstagram, campaign. The store column will not be included in this plot, but we could re-run it by setting color equal to store.
# The ggpairs function is from the GGally package. It creates a rich pairplot
ggpairs(df
, columns = 1:4 # Only include the first four columns,and leave out the last column, store.
, aes(color = campaign # Let's look for any difference among campaign.
, alpha = .4 # Makes the colors partially transparent
)
)
We will now illustrate steps 3-4 using the campaign 1.
Let’s first create a scatter plot between adspend and sales for the first campaign.
# Create a dataset only for the C1 campaign data
c1 <- df %>% filter(campaign == 'C1')
# Reshape the data from wide to long and then plot it
c1 %>%
pivot_longer(cols = c(adSpendFacebook, adSpendInstagram)
, names_to = 'ad', values_to = 'spend') %>%
ggplot(aes(x = spend, y = sales, color = ad)) +
geom_point()
This highlights the strong positive correlation between sales and adSpendFacebook for campaign 1.
Let’s use regression and use a backward elimination approach to creating a model that we can use for predicting sales from campaign 1.
Backward elimination is when you add in all the variables and then remove those that are insignificant.
lmc1_1 <- lm('sales ~ adSpendFacebook + adSpendInstagram + store', data = c1)
summary(lmc1_1)
##
## Call:
## lm(formula = "sales ~ adSpendFacebook + adSpendInstagram + store",
## data = c1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -628.43 -109.58 6.38 116.52 539.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 787.3297 61.3859 12.826 <2e-16 ***
## adSpendFacebook 18.5124 0.6219 29.768 <2e-16 ***
## adSpendInstagram 0.5810 0.5852 0.993 0.323
## storeB -52.5726 59.4661 -0.884 0.378
## storeC -41.7864 58.7861 -0.711 0.478
## storeD 31.1449 56.3893 0.552 0.582
## storeE -23.4717 53.3030 -0.440 0.660
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 203.4 on 135 degrees of freedom
## Multiple R-squared: 0.8707, Adjusted R-squared: 0.865
## F-statistic: 151.5 on 6 and 135 DF, p-value: < 2.2e-16
Observations
Let’s represent the insignificance of the adSpendInstagram and store variables.
# Reshape the data from wide to long and then plot it
c1 %>%
pivot_longer(cols = c(adSpendFacebook, adSpendInstagram)
, names_to = 'Platform', values_to = 'Spend') %>%
ggplot(aes(x = Spend, y = sales, color = Platform)) +
geom_point() +
stat_smooth(method = 'lm', se = F) +
facet_wrap(vars(store), ncol = 2)
Observations
* The adSpendInstagram line is pretty flat in most of the plots. This
means that it stays the same regardless of how much is spent, and is
therefore insignificant.
* The intercepts for adSpendFacebook are all about the same. There is a
little variation in intercepts for the adSpendInstagram lines. This is
why the store variable is insignificant.
Let’s create a simpler regression model by only including adSpendFacebook as a predictor variable.
lmc1_2 <- lm('sales ~ adSpendFacebook', data = c1)
summary(lmc1_2)
##
## Call:
## lm(formula = "sales ~ adSpendFacebook", data = c1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -585.83 -119.05 13.93 130.50 546.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 799.2518 39.9355 20.01 <2e-16 ***
## adSpendFacebook 18.5607 0.6142 30.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 202.6 on 140 degrees of freedom
## Multiple R-squared: 0.8671, Adjusted R-squared: 0.8661
## F-statistic: 913.1 on 1 and 140 DF, p-value: < 2.2e-16
Observations
Let’s make predictions with this model using the prediction data, and then visualize them in a 3d scatter plot using the plotly package.
# Read in the prediction data
pd <- readRDS('predictionData.rds') # Explore it by clicking on the pd dataframe in the Environment pane.
# Create a new column of predicted sales using the lmsa2 model
pd$sales_pred <- predict(lmc1_2, pd)
# Create an interactive 3d scatter plot
plot_ly(data = pd
, x = ~adSpendFacebook, y = ~adSpendInstagram, z = ~sales_pred
, color = ~store
, size = .5
, type = 'scatter3d'
, mode = 'markers'
)
Observations
Repeat this process with campaign 2, but using forward selection this time.
Let’s first create a scatter plot between adspend and sales for campaign 2.
# Create a dataset only for the se campaign data
c2 <- df %>% filter(campaign == 'C2')
# Reshape the data from wide to long and then plot it
c2 %>%
pivot_longer(cols = c(adSpendFacebook, adSpendInstagram)
, names_to = 'Platform', values_to = 'Spend') %>%
ggplot(aes(x = Spend, y = sales, color = Platform)) +
geom_point()
It’s hard to tell if either variable has a strong relationship with sales.
I will use forward selection.
lmc2_1 <- lm('sales ~ adSpendFacebook', data = c2)
summary(lmc2_1)
##
## Call:
## lm(formula = "sales ~ adSpendFacebook", data = c2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1318.38 -437.49 7.31 352.52 1411.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1515.677 98.154 15.442 < 2e-16 ***
## adSpendFacebook 6.467 1.710 3.783 0.000229 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 606.1 on 140 degrees of freedom
## Multiple R-squared: 0.09273, Adjusted R-squared: 0.08625
## F-statistic: 14.31 on 1 and 140 DF, p-value: 0.0002291
Observations
* The coefficient on adSpendFacebook is positive. I’ll keep that
variable, and add in another.
lmc2_2 <- lm('sales ~ adSpendFacebook + adSpendInstagram', data = c2)
summary(lmc2_2)
##
## Call:
## lm(formula = "sales ~ adSpendFacebook + adSpendInstagram", data = c2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1016.46 -344.77 -72.54 389.57 1077.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 916.436 96.871 9.460 < 2e-16 ***
## adSpendFacebook 5.375 1.320 4.072 7.78e-05 ***
## adSpendInstagram 13.208 1.337 9.876 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 466.3 on 139 degrees of freedom
## Multiple R-squared: 0.4668, Adjusted R-squared: 0.4592
## F-statistic: 60.86 on 2 and 139 DF, p-value: < 2.2e-16
Observations
* Both of the coefficients on adSpendFacebook and adSpendInstagram are
significant. I’ll keep those and add in store.
lmc2_3 <- lm('sales ~ adSpendFacebook + adSpendInstagram + store', data = c2)
summary(lmc2_3)
##
## Call:
## lm(formula = "sales ~ adSpendFacebook + adSpendInstagram + store",
## data = c2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -508.38 -123.83 3.79 126.90 517.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 735.0654 65.5871 11.207 < 2e-16 ***
## adSpendFacebook 5.2393 0.5703 9.188 6.32e-16 ***
## adSpendInstagram 14.0230 0.5810 24.135 < 2e-16 ***
## storeB -431.0324 60.9491 -7.072 7.47e-11 ***
## storeC 523.3755 60.8766 8.597 1.80e-14 ***
## storeD -84.1483 57.8160 -1.455 0.148
## storeE 688.4019 59.9669 11.480 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 199.6 on 135 degrees of freedom
## Multiple R-squared: 0.9051, Adjusted R-squared: 0.9009
## F-statistic: 214.6 on 6 and 135 DF, p-value: < 2.2e-16
Observations
Let’s make predictions with the lmc2_3 model using the prediction data, and then visualize them in a 3d scatter plot using the plotly package.
# Create a new column of predicted sales using the lmc2_3 model
pd$sales_pred_lmc2_3 <- predict(lmc2_3, pd)
# Create an interactive 3d scatter plot
plot_ly(data = pd
, x = ~adSpendFacebook, y = ~adSpendInstagram, z = ~sales_pred_lmc2_3
, color = ~store
, size = .5
, type = 'scatter3d'
, mode = 'markers'
)
Observations
Repeat this process with campaign 3. Use either backward elimination or forward selection.